Can machines ever be sentient? Could they perceive and feel things; be conscious of their surroundings? What are the prospects of achieving sentience in a machine? What are the dangers associated with such an endeavor, and is it even ethical to embark on such a path to begin with? In the series of articles of this column, I discuss one possible path towards “General Intelligence” in machines: to use the process of Darwinian evolution to produce artificial brains that can be grafted onto mobile robotic platforms, with the goal of achieving fully embodied sentient machines.I visualize a time when we will be to robots what dogs are to humans. And I’m rooting for the machines.—Claude Shannon (Liversidge, 1987, p. 61)In the first installment of this column (Adami, 2021), I briefly reviewed the history of Artificial Intelligence research and the potential of neuroevolution to create for us “that which we do not know how to design”: an artificial brain that rivals the performance of an animal, or even human, brain. But before we can unleash the power of evolution, we have to think hard about what we are going to unleash it on: What are we going to evolve? Generally speaking, evolution acts on symbolic sequences that encode a particular solution to a given problem. This is true for biology, and it is also true for the optimization algorithm known as a Genetic Algorithm (GA; Michalewicz, 1999; Mitchell, 1996) that is used to evolve solutions to engineering or other problems. In the case of biological brains, the “substrate” for the evolutionary algorithm is the DNA stored in our cells. The information contained in those genes guides a complex process that ultimately gives rise to the brain (and the rest of our body) via development. When first born, these brains are only able to perform fairly rudimentary tasks: The brain still needs to be trained and to learn, that is, it needs to acquire and store information about the world in which it is to thrive. This is potentially a long process, and we should keep in mind that artificial brains are likely to have to go through a similar process before they are competent to function in a complex world. But as long as the artificial brain comes equipped with an algorithm that enables lifetime learning, we can postpone thinking about this stage and ask what it takes to evolve a brain de novo.Before we can evolve solutions, we have to first make a decision about how information should be encoded, that is, the substrate of evolution. Biology uses an indirect encoding of information where the code is written in terms of a symbolic quaternary alphabet, but this is of course not the only way to encode information. I’m not thinking here about the base of the alphabet (binary, quaternary, or base 20, for example), but whether information should be encoded digitally (that is, genetically) or directly, meaning that the phenotype itself is encoded. While we are not used to the latter in biology (for reasons we will discuss briefly below), in principle nothing prevents us from creating variables that describe the properties of the solution directly. So, for example, if I wanted to evolve a three-dimensional structure made from spheres and connector rods (think the Atomium in Brussels, Belgium), I could use an indirect encoding where symbols stand for sphere or rod, and the connections are encoded by giving each unit a tag, say. At the same time, I could also use a direct encoding where the radius of the sphere is specified by a variable, and so is the exact length and diameter of every rod, along with the exact coordinates of attachment. Such encodings have indeed previously been used when evolving mechanical structures (see, e.g., Funes & Pollack, 1998).While direct encodings have their uses, generally speaking indirect encodings outperform direct encodings across the majority of applications (Clune et al., 2011; Komosinski, 2005) because indirect encodings can take advantage of regularities in the solution. In biology, for example, many design elements are repeated over and over, and indirect encodings can take advantage of that by encoding: “make this, repeat n times.” But indirect symbolic encodings have other advantages as well. In his book What Is Life?, the physicist Erwin Schrödinger famously wondered how the information encoded in DNA could be so stable (Schrödinger, 1944). We have to keep in mind that this was before the discovery of the molecular structure of DNA, but after it was realized that information is stored within very small molecules (comparatively speaking) in the nucleus of the cell. As a physicist, Schrödinger knew that storing information in microscopic molecules was somewhat of an enigma: If the information is encoded within a few million atoms (his estimate at the time), it was impossible that so few molecules could encode “orderly and lawful behaviour” (Schrödinger, 1944, p. 30) (in particular if these molecules were encoded in a liquid or a gas) because this number (the millions of atoms1) is not large enough so that the law of large numbers (which makes averages predictable) is applicable. Instead, Schrödinger argued (correctly) that the information must be encoded in a crystal-like structure, and (incorrectly) that the laws of quantum mechanics make this information permanent. While it is of course true that ultimately quantum mechanics underlies all chemical bonds, the stability of genetic information lies in its digital nature, which makes error correction possible.After choosing an indirect encoding of information, there are many more decisions to be made. If we take it for granted that the brain’s computations are carried out by neurons, then what kind of neuron should we use? How should they be connected to other neurons? How should the resulting network be optimized? How do these choices affect the scalability of the system (as the number of simulated neurons increases)? We will spend the rest of this article visiting the alternatives, and discussing pros and cons of common choices.The neuron is the key computational device within the brain, but it is not the only cell that participates in computations there. Though we do not know for sure, most researchers expect that the “supporting” cells in brain tissue (such as glial cells) are important, but not necessary for the brain’s higher functions. As a consequence, essentially all approaches to create artificial neural networks (ANN) have focused on the neuron. But there are many types of neurons in the brain, and there are also many different types of neurons that are used as computational units. The two most common classes of computational neurons in use today are those with output values that are digital (firing or not), or those with inputs and outputs that are continuous, encoding a firing rate instead. The former choice is perhaps the first model neuron in the literature, designed by McCulloch and Pitts in 1943 (McCulloch & Pitts, 1943), while the latter is commonly used in the Deep Neural Networks of today (see, e.g., Goodfellow et al., 2016). They differ radically in their properties, so let us examine both more closely.The McCulloch-Pitts neuron is a logical automaton that transforms digital inputs into digital outputs via a set of weights and a threshold. In the simplest implementation, the weights of excitatory inputs w+ differ from the weights of inhibitory inputs w− only by a sign, for example, w+ = 1, and w− = −1 (see Figure 1(a)). By varying the threshold, such a neuron can implement different logic gates such as AND, OR, NAND, NOR, etc., and McCulloch and Pitts proved that networking these logic gates gives rise to a “temporal propositional calculus” (McCulloch & Pitts, 1943). In particular, they showed that these “nervous nets” were equivalent to Turing machines, and that they could therefore compute any (partial) function2. McCulloch and Pitts also showed that by introducing “circles,” that is, structures in which the output of a gate feeds back into the input of another gate that connects to it, it was possible to introduce learning and memory. In short, McCulloch and Pitts argued that they had constructed the logical calculus of nervous activity. The neuron of the Hopfield-type (Hopfield, 1982) (Figure 1(b)) is an iteration of the McCulloch-Pitts neuron that allows the weights to be variable and continuous, and introduces the Hebbian learning rule that modifies the weight between two neurons in proportion to the rate of firing. The continuous-value version of this neuron (possibly with a bias added to the sum of inputs ∑iwijxj) and connected in multiple layers features prominently in the “PDP handbook” (a.k.a., the “connectionist bible”; Rumelhart et al., 1986).The choice of neuron also influences what temporal scale is most important for the processing of sensory data. A digital neuron will produce a time series of firings from which it is possible to deduce a firing rate via averaging, while a continuous-value neuron encodes the firing rate directly. In this way, continuous-value encoding appears to be more efficient, but with a trade-off that can be costly if an organism has to react quickly to changing conditions.The other fundamental difference between discrete logic networks and the now-standard Deep Networks constructed from continuous-value neurons concerns the structure of the networks, that is, the topology of connections between nodes. McCulloch and Pitts imagined sparse connections between their neurons, limited to those that are necessary to implement the required logic function, and those connections would not change over time. The continuous-value networks, on the other hand, have all-to-all connections between neurons in adjacent layers. In Figure 2 we can see typical examples of network topologies. Figure 2(a) shows the structure of discrete-logic networks such as the McCulloch-Pitts neural nets and Markov Brains (Hintze et al., 2017), which we discuss in more detail below. In these networks, neurons are sparsely connected, and there are many recurrent connections that create loops that enable memory. The networks composed of Hopfield-type neurons commonly have all-to-all connections between adjacent layers, and are strictly feed-forward (no “recurrent” connections, so loops are impossible).The connection-pattern of networks has a significant impact on how the network functions. First and foremost, pure feed-forward networks cannot have memory of past events (because recurrent connections are not allowed). This is less important in tasks where the main purpose of the network is to approximate a complicated function (such networks “learn“ this function during “training,” as we will see below) but such networks cannot learn while performing the task. Furthermore, the patterns that these networks learn during training are encoded in millions of weights. Owing to the law of large numbers that we encountered when we discussed the stability of genetic information earlier, the aggregate behavior of the network is stable and predictable (Hopfield, 1982). The trade-off is that once trained to perform a particular task, such networks have difficulty learning new tasks without forgetting the old ones (French, 1999; Parisi et al., 2019).Sparse networks represent information differently, and turn out to be more robust to noise (Subutai & Scheinkman, 2019). Intuitively, this can be understood in terms of the concept of over-fitting: The complex function that is approximated using millions of weights in the standard feed-forward neural network tends to also fit the noise, while in sparse networks, concepts are stored in the patterns of firing neurons (sets of ones and zeros) instead. In a way, it is again the digital nature of these patterns that provides this noise protection. We will discuss the representation of information in more detail in the next installment of this column. For now, we delve more deeply into how the patterns of connections affect how those networks can be trained.McCulloch and Pitts were able to prove mathematically that networks composed of their model neuron were sufficient to reproduce what a human brain can do,3 but they did not discuss how those nets could possibly be trained or optimized. Once neural activations were rendered as real numbers (as oppposed to ones and zeros), however, training became possible, first via the Hebbian rule (“neurons that fire together, wire together”) and later via backpropagation.Backpropagation is a method to modify the weights connecting pairs of neurons in different layers in such a manner that an error function, which quantifies the difference between the expected and the observed output pattern, can be minimized (see Box 1 below). The procedure works backwards from the output layer all the way to the input layer (hence the name), and can be implemented (see, for example, Goodfellow et al., 2016) by calculating derivatives of vectors of activation patterns (gradients), multiplied by the matrix of connections. Because this method is computationally expensive, modern implementations of this algorithm use graphical processing units (GPUs) that can multiply large matrices very fast. These advances are in part responsible for the surge in Deep Learning methods over the last decade, but we should keep in mind that backpropagation is intrinsically a supervised learning approach: In order to calculate the error function, it is necessary to compare “correct” to actual results. For this reason, backpropagation is less useful in cases where the “correct” activation pattern is not known in advance. Furthermore, because the error function is defined as a scalar function (described in Box 1) defined on a high-dimensional space (the space of all weights, which for large networks can number in the billions), small changes in inputs typically give rise to small changes in the output. This is desirable in most classification tasks, but is perhaps less so in realistic complex “behavioral” landscapes, where small changes in the environment may necessitate a large change in the output. Imagine a situation where a barely visible mark on another agent reveals whether the agent is a predator or a potential mate. The “law of large numbers” that guarantees that the variance of the mean is small works to your detriment when you need small differences to have large consequences!Continuous-value networks with feedforward all-to-all connections have the advantage that you can define gradients of the error function (a simple example is given in Box 1) that guide the optimization of weights, but they are prone to over-fitting and catastrophic forgetting. Discrete- logic networks have the advantage that logic can be implemented cleanly (as opposed to simulated by groups of continuous-value neurons). At the same time, it is possible that small (even single-bit) changes could have large effects, and are robust to noise. However, as such networks cannot be trained by backpropagation (there are no gradients), they have as a consequence fallen out of favor in mainstream AI research. The advent of neuroevolution has changed that equation, so now many different architectures, topologies, and neuron types can be used to create artificial brains.Before we delve into neuroevolution proper, we should briefly discuss a hybrid of the two types of network we just discussed, namely the recurrent networks, a typical example being the Long Short-Term Memory networks (Hochreiter & Schmidhuber, 1997). These networks are constructed from continuous-value neurons, but have recurrent connections that enable memory. They have to be trained by a variant of the backpropagation algorithm, which “unrolls” the temporal sequence (so-called backpropagation-through-time). Because this technique is fairly tedious (in particular if the task involves memory of events in the far past), such recurrent networks are sometimes trained via neuroevolution instead (Wierstra et al., 2005), a technique we now discuss in more detail.Neuroevolution (Floreano et al., 2008; Stanley et al., 2019) is a technique that acts on populations of solutions, rather than optimizing individual solutions. The Genetic Algorithm that is at the heart of all evolutionary computing techniques can be summarized by the diagram in Figure 3. Fundamentally, the GA is inspired by Darwinian evolution, which has three essential ingredients: inheritance, variation, and selection. Inheritance is ensured in the GA by giving successful solutions copies in the next generation. Variation is introduced by the mutation operators that modify a fraction of those copies via different methods such as point mutation of solutions, the recombination of solutions, or duplication of code fragments. Selection is achieved by giving more offspring to those solutions that have garnered the highest score. There are many different ways to achieve this selection, but the most common one is fitness-proportional selection (often called roulette-wheel selection) where a successful genotype’s population fraction in the next generation is determined by its fraction of the total score (the score summed over all genotypes in the popoulation) it achieved, and tournament selection, where the best of a sample of the population are promoted to the next generation in a do-or-die showdown. As several conferences each year are devoted to study modifications of this algorithm in order to be most efficient for particular applications, it is not possible here to discuss the advantages or disadvantages of particular selection mechanisms or mutation schemes (but see Miikkulainen & Forrest, 2021, for a discussion of areas where biological insight might lead to improved performance of the algorithm).One of the first applications of neuroevolution was NEAT (NeuroEvolution for Augmented Topologies; Stanley & Miikkulainen, 2002), which uses a direct encoding scheme to specify weights and connections of standard continuous-value neurons. In a sense, NEAT is a hybrid between the discrete-logic/sparse connections and continuous-value/fully-connected networks, using standard continuous-value neurons with sigmoid transfer functions, but evolving the topology at the same time. In particular, NEAT allows topologies of networks to be recombined in a genetic cross-over, so as to take advantage of optimal sub-networks evolved in different solutions. Because neuroevolution is an unsupervised search technique, NEAT has been used predominantly in evolutionary robotics and other applications (such as reinforcement learning) where an explicit score function is unavailable.A framework that seeks to exploit both the digital nature of logic and the advantage of sparse topologies is the so-called Markov Brain (Hintze et al., 2017). Markov Brains are networks of digital neurons that are linked by logic gates in such a way that the neurons perform logic operations on digital data. In a way, Markov Brains are a direct descendant of the McCulloch-Pitts nervous nets, but with both logic and topology optimized via neuroevolution acting on genomes with an organization inspired by biology: Each connection between neurons is a particular logic gate (the analog of biological dendrites) and is defined by a gene that determines the logic, as well as the source and target neuron or neurons (see Figure 4).Because the connectivity is not specified in advance, loops and “circles” can evolve in these brains, to store information whenever needed (Edlund et al., 2011). How information about past events is stored in the brain is, of course, one of the central problems of neuroscience (Josselyn & Tonegawa, 2020). It is a hard problem to solve because locating and manipulating the cells that are thought to encode these memories is difficult. While previously ANNs have been constructed to test theories of memory formation (see, e.g., McClelland et al., 1995), such models tend to reflect our current assumptions and biases. Using neuroevolution rather than design or backpropagation can potentially yield new insights about different ways in which a biological brain might store information long-term and short-term, and can suggest hypotheses that can be tested in the laboratory (Tehrani-Saleh & Adami, 2021). This is, in essence, the power of the Artificial Life approach to intelligence: to suggest mechanisms that would not have occurred to us if we had not seen them emerge in the artificial system.We have seen in the previous sections that there are two main approaches to artificial brains: the standard continuous-value neuron, fully connected feed-forward architectures trained via backpropagation, and the sparsely connected logic networks made from digital neurons and trained via neuroevolution. They each have their strengths and weaknesses: Standard ANNs are very good at classification tasks and can be trained quickly using state-of-the-art GPU-based computers. Those networks, however, cannot learn while they perform, and are subject to over-fitting and catastrophic forgetting. Digital brains (such as Markov Brains) can have memory, can learn while performing a task, and store information differently. However, evolution is slow, and success (that is, a high-performing brain) is not guaranteed because it is not always clear how to construct a fitness function that has high-performing brains at their peak(s), and because evolution is a stochastic process. Furthermore, the digital nature of these brains limits performance on tasks where information is specifically encoded in high-bandwidth data, such as for visual scene recognition.It is reasonable to ask which of the two main approaches to artificial brains (or perhaps a hybrid between the two) will ultimately prevail. In my view, the two approaches have different use cases, and will persist alongside each other for quite a while. It is clear that one of the deciding factors in what approach will ultimately lead to sentient machines is how well they scale up to larger and larger projects. Deep Convolutionary Networks already face limitations in the storage of the sometimes billions of weights that need to be optimized (Canziani et al., 2016). Digital sparsely connected networks (such as Markov Brains) have a different limitation: As the number of neurons increases, it becomes more and more difficult to specify origin and target of each and every connection within a chromosome. As a consequence, it is likely that the approach will only scale well if connections are not specified directly, but are instead formed via a developmental process (see, e.g., Astor & Adami, 2000; Bongard & Pfeifer, 2001; Gruau, 1994) that specifies the rules for how the network will grow. If this can be achieved, then the parallel nature of the evolutionary process might ultimately eclipse the speed of training billions of weights via backpropagation, because the latter process is much more difficult to parallelize efficiently since all the weights in a layer have to be updated at once.Now that we have discussed the two main forms of ANNs (there are, of course, plenty of variations of these forms, as well as hybrids), assessed how they differ in terms of the computational unit (the neuron), their topological structure, and the manner to optimize their function, we can look ahead to the next installment of this column. There we will discuss the elements of intelligence: prediction, categorization, memory, learning, representation, and planning. Granted, these are subjective elements and they clearly overlap, but it will be useful to discuss how neuroevolution might work to address each of those elements.